Claude Mythos Preview

SWE-bench Verifiedでの93.9%

！？

Released Apr 7, 2026

SWE-bench Goes Live!(arXiv:2505.23419v1 cs.SE 29 May 2025)

The issue-resolving task, where a model generates patches to fix real-world bugs, has emerged as a critical benchmark for evaluating the capabilities of large language models (LLMs). While SWE-bench and its variants have become standard in this domain, they suffer from key limitations: they have not been updated since their initial releases, cover a narrow set of repositories, and depend heavily on manual effort for instance construction and environment setup. These factors hinder scalability and introduce risks of overfitting and data contamination. In this work, we present SWE-bench-Live1

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

GraphWalks BFS

https://llm-stats.com/benchmarks/graphwalks-bfs-%3E128k